feat: R&D codec bench framework — upstream sync, probes P5/P7, InferenceBackend, measurement model by AdaWorldAPI · Pull Request #189 · AdaWorldAPI/lance-graph

AdaWorldAPI · 2026-04-17T13:36:40Z

Summary

R&D framework for codec psychometric benchmarking. Upstream sync, probe results, InferenceBackend trait, agent tooling, measurement model.

What's on this branch (9 commits)

Upstream sync

Stale snapshot removal — deleted AdaWorldAPI-lance-graph-d9df43b/ (182 files, 3 MB). Full audit: zero content loss, our src is a strict superset. Eliminates GitHub path confusion.
Cherry-pick spark_dialect.rs from upstream PR DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests) #150 — the ONE file upstream has that we didn't (107 LOC Spark SQL dialect + 293 LOC test).

Python reference headers

scripts/tts_inference.py and scripts/bake_hhtld_codebooks.sh now have "REFERENCE ONLY — Rust is canonical" headers pointing to the Rust equivalents.

InferenceBackend trait (`crates/thinking-engine/src/inference_backend.rs`)

Runtime-switchable dispatch across ALL codec/inference paths. Nothing killed.
Two classification axes: (full-path QJL vs leaf-only I8 hybrid vs passthrough) × (reconstruction vs signature vs hybrid grade).
7 backend structs: Passthrough, RaBitQ, Spiral, I8Hybrid, HhtlF32, Cascade, Base17Signature.
Designed for the EmbedAnything runtime-addressing pattern — switch backends without killing any path.

Probe results measured on real Qwen3-TTS-0.6B

Probe	Tensor	Result
P5 TurboQuant	k_proj [2048,1024]	All 4 correction methods ρ≥0.997 at L=1; ALL collapse to ρ=0.000 by L=5. Chain kills all — variance, not bias.
P7 PolarQuant HIP	k_proj [1024,1024]	PolarQuant-normalized families WORSE than Base17 L1 (-9%). Stripping magnitude before clustering loses informative coupling.

ADK behavior monitor agent

.claude/agents/adk-behavior-monitor.md — 7 anti-patterns (AP1-AP7) codified from PRs perf(tts_rvq_e2e): AVX-512 F32x16 FMA + AMX polyfill probe; recover AudioNode bridge #176-chore: remove stale upstream snapshot + port spark_dialect from upstream #150 #188. Flags session déjà-vu.

All agents → Opus 4.7

All 29 agent cards across both repos pinned to model: opus. Zero sonnet.

Invariants doc extended (470 LOC)

New invariants:

I9 BF17 shapeshifting — same bits carry different semantics per HHTL level (float at HEEL, signed coefficient at LEAF). Explains WHY PolarQuant-only splitting hurts.

New probes specified:

P8 Cronbach's α bench — psychometric measurement model. Codec candidates as test items, internal consistency (α) discovers factor structure. Epiphany × population correlation matrix ties every session lesson to testable predictions across 6 data populations (attention k_proj, MLP gate, vocab embed, Jina v5, audio codec, BGE-M3).
P9 Mixed bit-width per HHTL level — tests whether wider HIP (finer structure) × shorter LEAF beats narrow HIP × longer LEAF at same total bit budget. 6 variants from 38 to 102 bits/row. Core question: does accuracy at the address level compound through layers enough to pay off vs brute-force leaf precision?

Design principle

Nothing retired. Every research path coexists as an InferenceBackend variant. The bench runs all against all, Cronbach's α tells us factor structure, and deprecation is data-driven. Python is prep-only (HF download, ONNX export); Rust is the canonical inference runtime.

Test plan

cargo build --release --example polarquant_hip_probe — clean
cargo build --release --example turboquant_correction_probe — clean
P5 TurboQuant run — chain collapse measured
P7 PolarQuant HIP run — refuted (-9%)
P8 Cronbach's α bench implementation (next session — measurement model specified)
P9 resolution variants implementation (next session)
InferenceBackend impls for each existing path (next session)

Next session entry point

docs/CODEC_INVARIANTS_AND_EXPERIMENTS.md § P8 has the full measurement model: 7 codecs × 6 populations × 9 metrics × 6 resolution variants. The epiphany × population correlation matrix maps every invariant (I1-I9) to its testable prediction per population. Start by implementing cronbach_alpha in bgz_tensor::quality, then the bench fills the matrix.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

…e canonical path https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

## InferenceBackend trait (crates/thinking-engine/src/inference_backend.rs) Runtime-switchable dispatch across all codec/inference paths. Nothing killed — every research path coexists as a backend variant. Two key axes documented in the trait module: Axis 1 — full-path vs leaf-only quantization: Full-path QJL/PolarQuant: entire row → JL sign+magnitude (~20 B/row) Leaf-only I8 hybrid: HEEL+HIP location (6b) + i8 JLQ residual (9 B/row) Passthrough: exact (2×n_cols B/row) Axis 2 — reconstruction-grade vs signature-grade: Reconstruction: SafetensorsRaw, BurnFwd, CandleFwd, HhtlF32+SlotL Signature: RaBitQ, SpiralEncoding, CodecCascade, Base17 Hybrid: I8Hybrid (location + JLQ leaf) 7 backend structs registered in all_backends(). EncodedState enum carries opaque per-backend state. Trait methods: encode, score, reconstruct, bytes_per_row, shared_overhead_bytes, grade. ## TurboQuant P5 results (run on Qwen3-TTS-0.6B k_proj [2048,1024]) CRITICAL FINDING: all 4 correction methods (direct i8, Fisher z, QJL corrected, TurboQuant) hit rho >= 0.997 at single-layer, but ALL collapse to rho = 0.000 by layer 5 in a 33-layer chain. Single layer: Fisher z best (rho=0.999), all >= 0.997 Chain L=5: ALL 0.000 Drift/layer: QJL 6x lower bias than direct i8 (doesn't help) Root cause: variance, not bias. Repeated multiplication of quantized score matrices amplifies noise beyond recovery. QJL bias correction is correct but irrelevant when variance dominates. Implications: - Path B (cascade inference through 33 layers) NOT VIABLE as chained score multiplication - Single-layer cascade IS viable (rho >= 0.997) - I8 hybrid (HEEL+HIP + JLQ leaf) does f32 reconstruction, not chained scoring — different quality model, not refuted by this - Hybrid strategy: cascade per-layer, f32 GEMM between layers P5 status updated in docs/CODEC_INVARIANTS_AND_EXPERIMENTS.md: MEASURED — chain collapses, single-layer passes. https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

Codifies 7 anti-patterns (AP1-AP7) learned from PRs #176-#188 into an agent card that fires flags when the session repeats them: AP1: "225/225 feels like success" without gate 2 (#178) AP2: Projecting quality from docs instead of measuring (#177) AP3: Building new codec before benching existing ones (#184) AP4: Centroid-residual framing on near-orthogonal data (#177/#183) AP5: Python in the inference hot path AP6: Chained score multiplication without chain-collapse check (P5) AP7: Modifying ndarray without explicit permission (#176) Invoked by adk-coordinator when pattern repetition is suspected, or by human directly. Output: list of fired flags, max 7 lines. Also audited all 29 agent cards across both repos: - All pin model: opus or model: sonnet (no hardcoded versions) - opus → Opus 4.7 automatically, sonnet → Sonnet 4.6 - 3 ndarray agents on sonnet (l3-strategist, migration-tracker, product-engineer) — intentional for speed-over-depth roles - adk-coordinator missing Bash tool (by design — delegates) - sentinel-qa missing Edit/Write (by design — audit-only) No agent changes needed for Opus 4.7 compatibility — model: opus resolves correctly. https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

## P7: PolarQuant HIP family probe — REFUTED for pure direction split Measured on Qwen3-TTS-0.6B k_proj [2048,1024], 256 rows: Base17 L1 (current): 16.8% within-family NN recall (16/16 families) PolarQuant normalized: 7.8% within-family NN recall (16/16 families) Delta: -9.0% ← PolarQuant is WORSE Root cause: stripping magnitude before clustering loses informative signal. For k_proj rows, magnitude variation correlates with NN structure — rows with similar magnitudes tend to be nearest neighbors. Base17 L1 already encodes a JOINT direction+magnitude opinion through the golden-step fold. Pure-direction families throw away half the coupling. Insight: the "opinion as address" framing is correct, but the opinion must be JOINT direction+magnitude (like BF16's mantissa+exponent), not direction alone. This confirms the logarithmic-scale bgz17 philosophy: u8 encodes both axes simultaneously. Status: P7 REFUTED for PolarQuant-only normalization on k_proj. Base17 L1 families are already sufficient for this tensor shape. May differ for other roles (gate, up, down) — per-role probing is a follow-up. ## InferenceBackend trait (inference_backend.rs) Runtime-switchable dispatch design. 7 backend variants documented with two classification axes: Axis 1: full-path QJL vs leaf-only I8 hybrid vs passthrough Axis 2: reconstruction-grade vs signature-grade vs hybrid Trait: encode → EncodedState, score(i,j), reconstruct(i), grade(). Not yet wired into lib.rs (needs feature gate design for heavy deps). https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

## I9: BF17 shapeshifting Same 16-17 bit wire width carries different constructs at different HHTL levels: BF17 float at HEEL (joint direction+magnitude opinion), 4-bit partition at HIP, 8×i8 PolarQuant coefficients at LEAF. The "shapeshifting" is: exponent bits at HEEL become direction bits at LEAF; mantissa bits at HEEL become magnitude bits at LEAF. Explains WHY PolarQuant-only splitting hurts (P7 result): the coupling between direction and magnitude IS the information at HEEL/HIP level. ## P8: Cronbach's α codec bench — psychometric measurement model Reframes the R&D bench from "horse race" to "psychometric instrument validation." Codec candidates are test items; we measure internal consistency (α) to discover factor structure. ### Epiphany × population correlation matrix Cross-tabulates every invariant (I1-I9) and probe finding (P1-P7) against 6 data populations: attention k_proj, MLP gate, vocab embedding, Jina v5 output, audio codec embeddings, BGE-M3 output. Each cell predicts what should happen if the invariant holds on that population. The bench FILLS the cells. ### Populations chosen for cross-validation Different distribution signatures (near-orthogonal vs unit-normalized vs vocab-sparse vs SiLU-gated vs discrete-latent) ensure the factor structure is real, not artifact of one tensor's shape. ### Metrics 9 metrics per (codec × population) cell. 4 already in bgz_tensor::quality (pearson, spearman, top_k_recall, mae/rmse). 4 NEW to implement (Cronbach's α, Cohen's κ, bias, ICC). https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

claude added 6 commits April 17, 2026 10:41

docs: add REFERENCE ONLY headers to Python/shell scripts — Rust is th…

bf77641

…e canonical path https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

docs: P9 mixed bit-width per HHTL level — resolution as bench variable

9a0adbc

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

AdaWorldAPI merged commit b9b973d into main Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: R&D codec bench framework — upstream sync, probes P5/P7, InferenceBackend, measurement model#189

feat: R&D codec bench framework — upstream sync, probes P5/P7, InferenceBackend, measurement model#189
AdaWorldAPI merged 6 commits into
mainfrom
claude/codec-rnd-bench

AdaWorldAPI commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented Apr 17, 2026

Summary

What's on this branch (9 commits)

Upstream sync

Python reference headers

InferenceBackend trait (crates/thinking-engine/src/inference_backend.rs)

Probe results measured on real Qwen3-TTS-0.6B

ADK behavior monitor agent

All agents → Opus 4.7

Invariants doc extended (470 LOC)

Design principle

Test plan

Next session entry point

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

InferenceBackend trait (`crates/thinking-engine/src/inference_backend.rs`)